What is This Setup?
This configuration allows you to:- Run LLMs locally on your own hardware with GPU acceleration
- Host models like Llama, Mistral, or GPT-OSS for private AI inference
- Achieve faster response times compared to CPU-only inference
- Maintain data privacy by keeping everything on your infrastructure
- Access your LLM via web UI similar to ChatGPT
What is GPU Passthrough?
GPU passthrough (also called PCIe passthrough) allows a virtual machine to directly access a physical GPU, bypassing the hypervisor layer. This means:- Near-native performance: Your VM gets almost the same GPU performance as bare metal
- Direct hardware access: The VM controls the GPU as if it were physically installed
- Exclusive access: Only one VM can use the passed-through GPU at a time
- Required for GPU compute: Essential for running LLMs with GPU acceleration in VMs
Prerequisites
Before starting, you need:- A Proxmox server with an NVIDIA GPU installed
- An Ubuntu Server VM (22.04 or later recommended)
- Docker installed on the VM
- Basic familiarity with Linux command line
- SSH access to your VM
- Sufficient VRAM on your GPU
- Small models (7B parameters): 8GB VRAM minimum
- Medium models (13B-20B): 16GB+ VRAM
- Large models (30B+): 24GB+ VRAM
Step 1: Configure GPU Passthrough in Proxmox
Follow this video tutorial to set up GPU passthrough from your Proxmox host to your Ubuntu VM: 📹 Proxmox GPU Passthrough Guide The video covers:- Enabling IOMMU in BIOS
- Configuring Proxmox for PCIe passthrough
- Adding the GPU to your VM
- Verifying the setup
Step 2: Install NVIDIA Drivers
The NVIDIA drivers enable your Ubuntu system to communicate with the GPU hardware. Follow this guide for driver installation on Ubuntu: 📖 NVIDIA Driver Installation Guide Quick verification after installation:Step 3: Install CUDA Toolkit
CUDA is NVIDIA’s parallel computing platform required for GPU-accelerated applications. Download and install CUDA from the official source: 📦 CUDA Toolkit Downloads Select your operating system, architecture, and distribution to get the appropriate installation commands.Step 4: NVIDIA Container Toolkit
Install NVIDIA Container Toolkit: Follow the official installation guide to enable GPU access in Docker containers: 📖 NVIDIA Container Toolkit Installation Verify GPU access in Docker:nvidia-smi output as before, confirming Docker can access the GPU.
Step 5: Deploy Ollama and Open WebUI
Create a directory for your setup:docker-compose.yml:
Ollama Service
- OLLAMA_KEEP_ALIVE: -1: Keeps models loaded in GPU memory indefinitely for instant responses
- Port 11434: API endpoint for model inference
- Volume: Persists downloaded models between restarts
- GPU reservation: Ensures the container can access all available GPUs
Open WebUI Service
- Port 80: Web interface accessible at
http://your-vm-ip - ENABLE_ADMIN_CHAT_ACCESS: false: Disables admin user from accessing all chats (i mean.. its kinda creepy to check your employees chats)
- host.docker.internal: Allows the web UI to communicate with Ollama
- Volume: Stores user data, conversations, and settings
Step 6: Access Open WebUI and Download Models
Open your web browser and navigate to:- Click on your profile icon in the top right
- Go to Admin Panel → Settings → Models
- In the “Pull a model from Ollama.com” field, enter a model name
- Click the download button
Step 7: Verify GPU Acceleration
Check that your model is running on the GPU:- PROCESSOR: 100% GPU ✅ - Model is running on GPU (good!)
- PROCESSOR: 100% CPU ❌ - Model fell back to CPU
- UNTIL: Forever ✅ - Model stays loaded (due to
OLLAMA_KEEP_ALIVE: -1)
Step 8: Customize Model Context Length
The context window determines how much text the model can remember in a conversation. Larger contexts allow for longer discussions but use more VRAM. Access Ollama’s interactive mode:- Sets context to 10,000 tokens
- Saves as a new model variant with the custom context
- The new model persists these settings permanently
nvidia-smi after changing context length.
Verify your custom model: